Streymoy
- Europe > France (0.04)
- Europe > Faroe Islands > Streymoy > Tórshavn (0.04)
- North America > United States > Maryland (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- North America > Mexico > Puebla (0.04)
- (4 more...)
- Health & Medicine (0.93)
- Education > Curriculum > Subject-Specific Education (0.71)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Sweden > Östergötland County > Linköping (0.04)
- Europe > Iceland > Capital Region > Reykjavik (0.04)
- (21 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Asia > Indonesia > Bali (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (24 more...)
- Law (0.93)
- Information Technology (0.67)
A systematic review of relation extraction task since the emergence of Transformers
Celian, Ringwald, Gandon, null, Fabien, null, Catherine, Faron, Franck, Michel, Hanna, Abi Akl
This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (39 more...)
- Overview (1.00)
- Research Report > New Finding (0.45)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Pretraining Finnish ModernBERTs
Reunamo, Akseli, Peltonen, Laura-Maria, Moen, Hans, Pyysalo, Sampo
This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Finland > Southwest Finland > Turku (0.04)
- (19 more...)
If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs
Bungum, Lars, Huang, Charles Yijia, Kashar, Abeer
In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.
- Europe > Norway (0.14)
- North America > United States (0.14)
- Europe > Russia (0.14)
- (15 more...)
MultiZebraLogic: A Multilingual Logical Reasoning Benchmark
Bruun, Sofie Helene, Smart, Dan Saattrup
Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.
- North America > Canada (0.04)
- Europe > Faroe Islands > Streymoy > Tórshavn (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Oepen, Stephan, Arefev, Nikolay, Aulamo, Mikko, Bañón, Marta, Buljan, Maja, Burchell, Laurie, Charpentier, Lucas, Chen, Pinzhen, Fedorova, Mariya, de Gibert, Ona, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Kutuzov, Andrey, Laippala, Veronika, Li, Zihao, Luukkonen, Risto, Malik, Bhavitvya, Mikhailov, Vladislav, Myntti, Amanda, O'Brien, Dayyán, Poláková, Lucie, Pyysalo, Sampo, Sánchez, Gema Ramírez, Siewert, Janine, Stepachev, Pavel, Tiedemann, Jörg, Vahtola, Teemu, Variš, Dušan, Vitiugin, Fedor, Vojtěchová, Tea, Zaragoza, Jaume
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
- Europe > Austria > Vienna (0.15)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- (16 more...)
EuroSpeech: A Multilingual Speech Corpus
Pfisterer, Samuel, Grötschla, Florian, Lanzendörfer, Luca A., Yan, Florian, Wattenhofer, Roger
Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
- Research Report (0.51)
- Workflow (0.46)